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Cu , The decomposition of a sample of images on a relevant subspace 

is a recurrent problem in many different fields from Computer Vision 
to medical image analysis. We propose in this paper a new learning 

>0 . principle and implementation of the generative decomposition model 

generally known as noisy ICA (for independent component analy- 
sis) based on the SAEM algorithm, which is a versatile stochastic 

Q, ■ approximation of the standard EM algorithm. We demonstrate the 

applicability of the method on a large range of decomposition models 
and illustrate the developments with experimental results on various 
data sets. 
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1. Introduction. Independent Component Analysis (ICA) is a statistical 
technique that aims at representing a data set of random vectors as linear 
K^ ■ combinations of a fixed family of vectors with statistically independent co- 

efficients. It was initially designed to solve source separation problems in 
f~^ I acoustic signals [Bremond, Moulines and Cardoso (1997)] and rapidly found 

cn ■ a large range of applications, in particular, in medical image analysis [Cal- 

cn ! houn et al. (2001), Calhoun, Adah and McGinty (2001)], where ICA has 

^^ [ become one of the standard approaches. And because it is often valuable 

to decompose a large set of variables into simple components, ICA applies 
more generally as well [in computer vision Bartlett, Movellan and Sejnowski 
(2002), Beh and Sejnowski (1995a), Farid and Adelson (1999), Liu and Wech- 
^ . sler (2003); and in computational biology Liebermeister (2002), Makeig and 

C^ ; Jung (1997), Scholz et al. (2004), etc.]. 

Often in such problems, the data are high dimensional but have small to 
moderate sample size, which complicates statistical analysis. For example, 
one challenge in medical imaging is to extract significant information from 
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spatially varying anatomical or functional signals drawn from a relatively 
small number of individuals. A common way to address this issue is to 
apply dimension-reduction techniques to reduce the information to a smaller 
number of highly informative statistics. ICA can be used for this purpose, 
and, in many cases, the representations it provides are qualitatively very 
different from those obtained using decorrelation methods such as principal 
components analysis (PCA) [Uziimcii et al. (2003)]. 

ICA can be formulated in terms of a generative model that approximates 
the distribution of the data, allowing well-understood statistical methods 
to be used for training and validation. ICA represents an observed d-di- 
mensional random variable X as 

d 

(1.1) X = ^/3^a,-, 

where (ai, . . . , a^^) € M are parameters (called decomposition vectors) and 
/3^, . . . ,/3'^ are independent scalar random variables drawn from a specified 
distribution (or family of distributions). One product of ICA is an estimate 
of the decomposition matrix A = (ai,...,arf) based on i.i.d. observations 
(Xi, . . . , X„). With model (1.1), the independent components /3^, . . . , /J'^ can 
be computed from X by inverting A. A variety of methods and criteria have 
been proposed to estimate either A or W = A~^ (see http://www.tsi. 
enst .fr/icacentral/index.html, from which some algorithms may be ac- 
cessed). For example, in Arie (2002), A is seen as a joint diagonalizer of a set 
of estimated correlation matrices. In Bell and Sejnowski (1995b) and Eriks- 
son, Karvanen and Koivunen (2000), standard estimation procedures, like 
maximum entropy or minimum Kullback-Leibler divergence, are used with 
specified distributions for the independent components. 

We will also use a model-based formulation in this paper, but it is impor- 
tant to mention that a large class of algorithms have also been defined for 
distribution-free representations (based on the so-called negentropy — non- 
Gaussian entropy — and cumulant expansion) , including the widely used Fas- 
tlCA method [Hyvarinen and Oja (1997)], as well as algorithms proposed in 
Learned-Miller et al. (2003) or Bach and Jordan (2003), which maximize the 
independence (with respect to some criteria) of the components using the 
semi-parametric model of ICA. A comparison study has been made in Car- 
doso (1999), using high-order measures to assess component independence. 

One of the drawbacks of ICA is that it does not come (like PCA does) 
with a well-defined method to select the most important components. In the 
original formulation, the number of independent components is equal to the 
dimension of the variables, so that the decomposition is achieved without 
dimensional reduction. This leads to computational and overfitting issues 
when dealing with high-dimensional data and small sample sizes, and a lack 
of interpretability of the obtained results. 
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Probabilistic ICA (alternatively called noisy ICA, or independent factor 
analysis, although we will reserve the latter term to a more specific method 
in which factors are Gaussian mixtures) assumes a small number of indepen- 
dent components, with a residual term which is modeled as Gaussian noise. 
The explicit model is therefore given by 

p 
(1.2) X = ^(3^aj + ae, 

i=i 

where (ai, . . . ,SLp) € M.'^^'p now represent d x p parameters (to be compared 
to the d X d matrix. A, estimated in the standard ICA model), (3^, . . . ,13^ 
are independent scalar random variables and e, the noise, follows a standard 
normal distribution (we will take the standard deviation, o", to be a fixed 
scalar, also a parameter). Such models therefore represent the d-dimensional 
input vector, X, by p scalar components, achieving the required dimensional 
reduction. 

The ICA training algorithms (e.g., estimating W) do not generalize to 
probabilistic ICA. In particular, the d-dimensional vector X is modeled as 
a function of the {p + d) -dimensional variable (/3, e) and we have partial ob- 
servations. A possible approach is to first implement some dimension reduc- 
tion to the data, typically projecting X on the p first principal components 
to eliminate the residual, before applying standard ICA to the projection 
[Come et al. (2008), Varoquaux et al. (2010)]. But this procedure does not 
necessarily retrieve the model described in (1.2) (especially when the noise 
has a large variance), and training probabilistic ICA in a way which is con- 
sistent with this statistical model certainly is a more satisfactory approach. 

The numerical method described in this paper estimates the maximum 
likelihood estimator associated to (1.2), where the likelihood is for the ob- 
servations, X, therefore averaging over the unobserved components /3. This 
differs from the solution which is often adopted in the literature, which con- 
sists in maximizing the joint likelihood of X and /3, simultaneously in the 
parameters and in the unobserved variables [Hyvarinen (1999)]. This latter 
method attempts to solve the parametric estimation and hidden variable re- 
construction problems at the same time. However, the estimation of both X 
and (3 is not always a good choice, because it can lead to biased estima- 
tors: as we will show in our experiments, these approaches have good results 
when the noise level is small [as already noticed in Valpola Lappalainen and 
Pajunen (2000)], but these results can significantly degrade otherwise [see 
Section 5, or Allassonniere, Amit and Trouve (2007) for a similar observation 
made in a different context]. In contrast, averaging over the unobserved vari- 
ables takes the whole distribution into account, which becomes important 
as soon as the posterior distribution is not unimodal, with its mean equal 
to its mode. The reconstruction problem (estimating /3 from X), which is 
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also important, for example, to define efficient lossy compression methods, 
can be solved afterward using the estimated parameters. Estimation and 
reconstruction are, in this regard, two separate problems. 

When independent components are modeled as mixtures of Gaussians, as 
done in Moulines, cois Cardoso and Gassiat (1997) for blind source separa- 
tion and blind deconvolution, or with independent factor analysis (IFA), as 
introduced in Attias (1999), maximizing the likelihood of the observations 
(averaging over the nonobserved independent components) can be done us- 
ing the expectation-maximization (EM) algorithm. In this particular case, 
this algorithm can be derived with closed form formulae and explicit compu- 
tations. But, even in this special case (mixture of Gaussians), the EM algo- 
rithm can become computationally prohibitive, especially when the number 
of components is large. For general component distributions, the explicit 
evaluation of conditional expectations given observations constitutes an in- 
feasible task, and only Markov chain Monte Carlo (MCMC) approximations 
remain available. Replacing explicit formulae by Monte Carlo approxima- 
tions in the E-step of the EM algorithm leads to the MCEM algorithm, 
introduced in Moulines, cois Cardoso and Gassiat (1997). MCEM, however, 
still is a highly computational procedure, with many Monte Carlo samples 
required at each update of the parameters. 

We suggest using an alternative approach for maximum likelihood esti- 
mation, relying on a stochastic approximation to the EM algorithm [called 
SAEM, Delyon, Lavielle and Moulines (1999)] which only requires being able 
to sample from this conditional distribution. Instead of running a long Monte 
Carlo simulation at each £'-step, as MCEM does, this algorithm interlaces 
sampling with the M-step, requiring only a single new sample between two 
parameter updates. This algorithm compensates the larger convergence time 
(in number of steps) generally associated to stochastic approximations by 
much simpler iteration steps. This algorithm has been proposed and proved 
convergent under some weak conditions in Allassonniere, Kuhn and Trouve 
(2010). A comparison between the MCEM and SAEM is proposed as part 
of the experiments provided here. 

Another advantage of our learning algorithm is that it applies to many 
different probabilistic distributions. There are almost no restrictions to the 
range of statistical models that can be used for the unobserved independent 
variables. As examples, we will present in this paper different models that 
all fit into this same framework, but which correspond to different statistical 
contexts. They will be introduced in Section 2. The parametric estimation 
method, including the SAEM algorithm, is described in Section 3 and the 
reconstruction of hidden variables is discussed in Section 4. Experimental 
results with both synthetic and real data are presented in Section 5 where 
we also provide some comparison with the EM (when feasible), MCEM and 
FastICA, three of the most used algorithms. 
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2. Models. We start with some general assumptions on the data, that 
win be made specific in the experiments. We assume that the observation is 
a set of vectors which take values in W^. Let Xi, . . . , X„ be the training obser- 
vations, which are assumed to be independent and identically distributed. 
We will denote by X a generic variable having the same distribution as 
theXfc's. The J th coordinate of X (resp., X^) will be denoted X-' (resp., Xj,). 

We assume that X can be generated in the form 

p 
(2.1) X = /io + ^/3^aj+ae, 

where /ig € M , slj G M for all j G {1, . . . ,p}, e is a standard d-dimensional 
Gaussian variable and f3^,...,/3P are p independent scalar variables, the 
distribution of which being specified later. Let (3 denote the p-dimensional 
variable /3 = (/3^, . . . ,/3p). To each observation X^ is therefore associated 
hidden realizations of /3 and e, which will be denoted Pj^ and Sk- 

Denote A = (ai, . . . , a^). It is a d by p matrix and one of the parameters 
of the model. Another parameter is a, which will be a scalar in our case 
(a diagonal matrix being also possible). Additional parameters will appear 
in specific models of /3 which are described in the following subsections. In 
some of these models, it will be convenient to build /3 as a function of new 
hidden variables, which will be denoted Z. 

The models that we describe are all identifiable, with the obvious restric- 
tion that A is identifiable up to a permutation and a sign change of its 
columns (the latter restriction being needed only when the distribution of (3 
is symmetrical). This fact derives from identifiability theorems for factor 
analysis, like Theorem 10.3.1 in Kagan, Linnik and Rao (1973). 

2.1. Logistic distribution (Log-ICA). We start with one of the most pop- 
ular models, in which each (3^ follows a logistic distribution with fixed pa- 
rameter 1/2. The associated cumulative distribution function is P{(3-' <t) = 
l/(l + exp(-2t)). 

For this model, the parameters to estimate are 9 = (A,cr^,/Xo)- Hidden 
variables are Z = /3 and e. This is the model introduced in the original paper 
of Bell and Sejnowsky [Bell and Sejnowski (1995a)], and probably one of the 
most commonly used parametric models for ICA. One reason for this is that 
the logistic probability density function (p.d.f.) is easy to describe, smooth, 
with a shape similar to the Gaussian, but with heavier, exponential, tails. 
Note that, for identifiability reasons, one cannot use Gaussian distributions 
for the components. 

2.2. Laplacian distribution (Lap-ICA). A simple variant is to take /3^ 
to be Laplacian with density e~'^'/2. The parameter still is 6 = (A,(T^,/^o)- 
Hidden variables are Z = /3 and e. 
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The resulting model is very similar to the previous one with similar 
exponential tails, with the noticeable difference that the Laplacian p.d.f. 
it is not differentiable in 0. One consequence of this is that it leads to 
sparse maximum a posteriori reconstruction of the hidden variables (cf. Sec- 
tion 4). 

2.3. Exponentially scaled Gaussian ICA (EG-ICA). In this model, we let 
j3^ = s^Y^ where Y is a standard Gaussian vector, s^, . . . , s^ are independent 
exponential random variables with parameter 1, also independent from Y 
and e. In this case, we can write 

p 
(2.2) X = Ho + ^s^Y^aj + ae. 

j=i 

Hidden variables are Z = (s, Y) and e, and the parameter is ^ = (A,o"^,/^o)- 
The p.d.f. of /3 = sy is given by giP) = J^exp{-^y^ - |) ^. It tends 

to infinity at /? = 0, and has sub exponential tails, because log[-P(/3* > t)] 
is asymptotically proportional to (— t^'^) (see the Appendix for details). It 
therefore allows for higher sparsity and more frequent large values of the 
component coefficients. This may help to overcome the variability in inten- 
sity which appears in medical images for examples. If we think in terms of 
source separation, the source has its own intensity and observations may 
require a large range of intensity around this "mean." It is also important 
to notice that, in spite of its increased complexity, this model can be imple- 
mented and learned as simply as the previous two using the algorithm that 
is proposed here. 

2.4. Independent factor analysis (IFA). The IFA [Attias (1999), Miskin 
and MacKay (2000), Moulines, cois Cardoso and Gassiat (1997)] model is 
a special case of probabilistic ICA in which the distribution of each coordi- 
nate (3^ is assumed to be a mixture of Gaussians. We will here use a restricted 
definition of the IFA model which will be consistent with the other distri- 
butions that we are considering in this paper, ensuring that the /3-^'s are 
independent with identical distribution, and that this distribution is sym- 
metrical. 

More precisely, we will introduce two new sets of hidden variables, the 
first one, denoted (t^, . . . ,t^), represents the class in the mixture model, and 
the second one, denoted (b^, . . . ,bP), is a random sign change for each com- 
ponent. Each t^ takes values in the finite set {0, 1, . . . ,K}, with respective 
probabilities wq,. . . , wk, and b^ takes values ±1 with probability ^. We then 
let 



I3^ = bi^mk5kif) + Y\ 



k=l 
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where Y^ is standard Gaussian. In other terms, /3-^ is a mixture of 2K + 1 
Gaussians with unit variance, the first one being centered, and the following 

ones having means mi, —mi,m2, — w-2, 

The parameters of this model are therefore 6 = {A,a'^,{wk,mk)i<k<K)- 
Hidden variables are Z = (/3, b,t). Note that, even if we use a simplified 
and symmetrized version of the model originally presented in Attias (1999), 
the stochastic approximation learning algorithm that will be designed in 
Section 3.2 immediately extends to the general case where the means depend 
on the index j. 

2.5. Bernoulli- censored Gaussian (BG-ICA). In contrast with the logis- 
tic or Laplacian models for which coefficients vanish with probability zero, 
we now introduce a discrete switch which "turns them off" with positive 
probability. Here, we model the hidden variables as a Gaussian-distributed 
scale factor multiplied by a Bernoulli random variable. We therefore define 
pi = VY^ , using the same definition for Y as in Section 2.3 and letting IP 
have a Bernoulli distribution with parameter a = P{V = 1). We assume that 
all variables 6^, . . . , 6^, y^, . . . , Y^, e are independent. The complete model 
for X has the same structure as before, namely, 

p 
(2.3) l^ = HQ + ^yY^aij+a£. 

Parameters in this case are 9 = (A,o"^,a,/^o) ^'^'^ hidden variables are 
Z = (b,Y) ande. 

Using a censoring distribution in the decomposition is a very simple way 
to enforce sparsity in the resulting model. The population is characterized 
by a set of p vectors, however, each subject is only described by a subset 
of these p vectors corresponding to the active ones. The probability of the 
activation of the vectors is given by a. As a increases, the sparsity in the 
subject decomposition increases as well, whereas the dimension to explain 
the whole training set may remain equal to p. Censored models therefore 
arise naturally in situations where independent components are not expected 
to always contribute to the observed signals. This often occurs in spatial 
statistics, in situations for which observations combine basic components in 
space, not necessarily occurring all together. We will see an example of such 
a situation with handwritten digits where components can be interpreted 
as common parts of some of the digits, but not all, and therefore should 
not be selected every time. Functional magnetic resonance images (fMRIs), 
for which ICA methods have been extensively used [Calhoun et al. (2001), 
Calhoun, Adali and McGinty (2001), Makeig and Jung (1997)], are also 
important examples of similar situations. These three-dimensional images 
indicate active areas in the brain when a subject executes a specific cognitive 
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task. People generally interpret components as basic processing units that 
interact in a complex task, but these units are not expected to be involved 
in every task for every subject. Similarly, genomic data, where a gene can 
activate a protein or not for particular patients, may fall into this context 
as well. 

We now describe some possible variants within the class of censored mod- 
els. 

2.6. Exponentially scaled Bernoulli- censored Gaussian (EBG-ICA). Com- 
bining EG- and BG-ICA, so that a scale factor and a censoring variable 
intervene together, we get a new complete model for X given by 

p 

(2.4) X = /iQ + ^ s^VY^Sij + as. 

i=i 
Since the exponential law has fixed variance, the parameters of interest are 
the same as in the BG-ICA model, that is, Q = (A,cj^,a,/iQ). The hidden 
variables are Z = (s,b, Y) and e. 

2.7. Exponentially- scaled ternary distribution (ET-ICA). The previous 
models include a switch which controls whether the component is present 
in the observation or not. One may want to further qualify this effect as 
"activating" or "inhibiting," which can be done by introducing a discrete 
model for Y, each component taking values —1, or 1. We define (3^ = 
s^Y'J , where s^, . . . , s^ are i.i.d. exponential variables with parameter 1. We 
let 7 = P(Y^ = —1) = P(Y^ = 1), providing a symmetric distribution for 
the components of Y. As before, all hidden variables are assumed to be 
independent. The model is 

p 

(2.5) y. = HQ + Y^ s^Y^Sij + ae. 

i=i 
Hidden variables here are Z = (s,Y) and e, the parameter being 6 = 

(A,o■^7,^lo)• 

The interpretation of the decomposition is that each component has a fixed 
effect, up to scale, which can be positive, negative or null. The model can 
therefore be seen as a variation of the Bernoulli-Gaussian where the ef- 
fect can be a weighted inhibitor as well as a weighted activator. This allows 
selective appearance of decomposition vectors and therefore refines the char- 
acterization of the population. 

This particular model makes all its sense when trying to model the genera- 
tion of data with nonzero mean. Going back to our fMRI example, the mean 
image is more likely to be an active brain since all the patients are subject 
to the same cognitive task and the activation is always positive or zero. This 
will create some active areas in the mean brain (/Xg). However, as we already 
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noticed, these areas can be active or not depending on the subject partici- 
pating to the experiment. This can be modeled by a weighted activation or 
inhibition of its areas around the mean through the corresponding decom- 
position vectors. The decomposition vectors are still expected to correspond 
to the different active zones. This is what this model tries to capture. We 
will see in the experiments that it also applies to the handwritten digits. 

2.8. Single-scale ternary distribution (TE-ICA). The previous model can 
be simplified by assuming that the exponential scale factor is shared by all 
the components, that is, we let (3^ = sY^ , where s is exponential with param- 
eter 1, and Y^ has the same ternary distribution as in the ET-ICA model. 
The decomposition now is 

p 
(2.6) X = /Xo + s^l"^aj+cje. 

i=i 

Hidden variables here are (s, Y), the parameter being = (A,cr^,7,/XQ). 
Notice that this model is not explicitly an ICA decomposition, since the 
components are only independent given the scale factor. Notice also that we 
assume that the scaling effect acts on the components, not on the observation 
noise which remains unchanged. 

Probabilistic-ICA in general is obviously a very efficient representation for 
lossy compression of random variables, since, if the noise is neglected, and 
as soon as the parameters /Xq and A are known, one only needs to know the 
realization of j3 (hopefully with p <^d) to reconstruct an approximation to 
the signal. In the present model, the transmission of f3 only requires sending 
the scalar scale factor, s, and p ternary variables. If many components vanish 
(i.e., if 7 is significantly smaller than 1/2), compression is even more efficient. 

In this model (and for the previous two also), the sparsity of the repre- 
sentation will obviously depend on the number of selected components, p, 
that we suppose given here. When p is too small, it is likely that the model 
will find that censoring does not help and take 7 = 1/2 (or a = 1 in the 
Bernoulli-Gaussian model). Adding more components in the model gener- 
ally results in a and 7 decreasing, enabling some components to be switched 
off. This effect is illustrated in Section 5. 

Finally, let's remark that, although the results in Kagan, Linnik and Rao 
(1973) do not directly apply to this model (the components are not inde- 
pendent, since they share the same scale factor), they can be applied to 
the conditional distribution given the scale to prove identifiability (since the 
scale factor distribution is fixed). 

2.9. Playing with the average. Clearly, all the previous models admit 
a centered submodel in which /Xg = 0, which might be preferred in some 
cases. In this case (/Xq = 0), it may be interesting to allow for some shift in 
the distribution of the components, replacing (3^ by ^ -|- (3^ where /.i is a one- 
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dimensional parameter. This is therefore equivalent to modeling /1q = A/i 
where /i is a p-dimensional vector with all coordinates equal to /i. When 
dealing with scaled, or censored models, one can decide to apply the shift 
before or after censoring or scaling. For example, one can define a shifted 
Bernoulli-Gaussian model by replacing Y^ hy fi + Y^ in Section 2.5, which 
results in shifting (3^ only when it is not censored. 

Another choice that can also be interesting is to model the signal with 
a random, scalar, offset (or AC component). One way to achieve this is to 
impose that one of the columns of the matrix A is the d-dimensional vector 
(1, . . . , 1) . In this case, it is natural to separate the distribution of the offset 
coefficient from the ones of other components, as customary in compression 
(the offset coefficient should not be censored, e.g.). A simple choice is to 
provide it with a logistic or Laplacian distribution. This is illustrated in the 
next model. 

2.10. Single-scale ternary distribution with offset (TEoff-ICA). In this 
model the mean /^q is not a parameter and is not the same for all the observed 
vectors, so that this random effect (in opposition to the fixed effect it had in 
the previous models) now is a hidden variable. We furthermore assume that 
this random variable, denoted /x, takes the form ^ = (//,...,//)€ M"^ where /i 
is Laplacian. So /_f can be interpreted as an offset acting simultaneously on 
all coordinates of X. This yields the following model: 

p 
(2.7) X = /I + s ^ y^aj + ae, 

i=i 

where s follows an exponential distribution with parameter 1 and Y^ are 
ternary variables with 7 = P{Y^ = —1) = P{Y^ = 1). The hidden variables 
are (s, Y,/^) and the parameters {A^a"^,^). 

Introducing observation-dependent offset and scale effects is useful when 
dealing with uncalibrated observations. This is typical, for example, with 
micro-array data, for which strong variations in calibration can occur among 
different patients. This is also common for signal and image processing, for 
which interpretation often needs to be performed in a way which is invariant, 
or robust, to offset or scale effects. 

3. Maximum likelihood estimation. 

3.1. Notation. The previous models are all built using simple generative 
relations Z — )• /3 and (/3, e) — > X. Our goal here is to estimate the parameters 
that maximize the likelihood of the observation of n independent samples 
of X that we will denote x*" = (xi, . . . , x„). 

Let qm{z;9) denote the prior likelihood of the hidden (or missing) vari- 
able Z that generates /3. Denote by qc{:>c\z;9) the conditional distribution 
of X given Z = z which is, in all our models, a Gaussian distribution centered 
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at the ICA decomposition. The joint density is 

q{x, z; 9) = gc(x|z; 9)qm{2; 0) 
and the marginal distribution of X has density 

gobs(x;0)= / qc{x\z-e)qjn{z;9)dz. 

Our goal is to maximize the likelihood of the observations, namely, to find 

n 

(3.1) 9n = argmaxc/*^,(x*"; 9) with <?X(x*"; 0) = U QoU^k^e). 

3.2. SAEM algorithm. This problem can, in principle, be solved using 
the expectation-maximization (EM) algorithm. With the EM, a local max- 
imum of the likelihood is computed recursively while replacing the missing 
variables with a conditional expectation. For each observation x^. and pa- 
rameter 9, we define the conditional density of Z by 

(3.2) i/fc,e(z) = g(z|X = Xfc;e). 

The EM algorithm iterates the following two steps, where t indexes the 
current iteration: 

E: expectation. Compute it+i : 9 ^ it+i{9) = Y^'^=i^Uk^g^[^ogq{xk,Z;9)]. 
M: maximization. Set 9t+i = argmaxg^Q it+i{9) . 

The models we have discussed for ICA belong to the curved exponen- 
tial family, in the sense that the joint distribution of hidden and observed 
variables for a given parameter can be expressed as 

logq{x,z;9) = ^{9)-S{x,z)-logC{9), 

where S is a multidimensional sufficient statistic, (/> is a fixed, vector- valued 
function of the parameters, C is a normalizing constant and the dot refers 
to the usual Euclidean dot product. This implies that 



n 



et+i{9) = ^{9)-lY.E,^^SJ -nlogC{9). 

Thus, the -E-step only requires computing the conditional expectations of 
the sufficient statistic, and the M-step is equivalent to maximum likelihood 
for a fully observed model, with the empirical expectation of the sufficient 
statistic equal to 



1 " 



n 

k=l 



This is an important property (satisfied by our models) for the numerical 
feasibility of the EM algorithm. 
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However, this is not enough, since one must also be able to explicitly 
compute the conditional expectations. For several of our models, there is 
no closed form expression for the densities iy'k,e- For others, like IFA, for 
which such an expression can be derived, its computational complexity is 
exponential in the number of components and rapidly becomes intractable 
(details are given in the Appendix). 

A common way to overcome this difficulty is to approximate these con- 
ditional distributions by Dirac measures at their mode. The resulting algo- 
rithm is sometimes called EM-MAP or FAM-EM (for "Fast Approximation 
with Mode") [Allassonniere, Amit and Trouve (2007), Allassonniere, Kuhn 
and Trouve (2008)]. At each iteration of the algorithm, one computes the 
most likely hidden variables z^, 1 <k <n, with respect to the current pa- 
rameters: 

(3.3) Zf _fc = arg max[log(g(z|X = x^, 9t))] ■ 

z 

The M-step then maximizes the likelihood for the "completed observa- 
tions" X*'" and zt^i, . . . ,zj^.„. 

The statistical accuracy of this approximation is unclear, since it estimates 
a number of parameters that scales like the number of observations. Consis- 
tency of the obtained estimator when n goes to infinity cannot be proved in 
general. Some experimental evidence of asymptotic bias is demonstrated in 
Section 5 below. 

In spite of these remarks, this approach (or approaches similar to it) 
is the most common choice for training probabilistic ICA models [Grimes 
and Rao (2005), Hyvarinen (1999), Olshausen and Field (1996a, 1996b)]. In 
the under-determined problem (p^d), this algorithm has also been imple- 
mented in Bremond, Moulines and Cardoso (1997). 

Although the conditional distribution is not explicit, it is still possible 
(as we shall see later) to sample from it. The conditional expectation of the 
sufficient statistics (Sj+i) can therefore be approximated by Monte Carlo 
simulation, as proposed in Tanner (1996) and Wei and Tanner (1990) with 
the MCEM (Monte Carlo EM) algorithm. The resulting method, however, is 
heavily computational. Also, there is no guarantee that the errors resulting 
from the approximation to the E-step will cancel out to provide an estimator 
converging to a local maximum of the likelihood. 

In this regard, a more interesting procedure, which has been proposed in 
Delyon, Lavielle and Moulines (1999), is a stochastic approximation of the 
EM algorithm, called SAEM. It replaces the E-step by a stochastic approxi- 
mation step for the conditional likelihood (or, in practice, for the conditional 
expectation of the sufficient statistics), on which the M-step is based. More 
precisely, based on a sequence At of positive numbers decreasing to 0, the 
algorithm iterates the following two steps (assuming the tth iteration): 
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SAE step. For k = l,. . . ,n, sample a new hidden variable zt+i^k according 
to the conditional distribution i'k,9t and define 



^t+i(0)=^i(^) + AJ^logg(xfc,Zi+i,fc;0)-^t(^)J. 
M step. Set 



614+1 = argmax£t+i(6'). 
6»ee 

For exponential families, the SAE step is more conveniently (and equiva- 
lently) replaced by an update of the estimation of the conditional expecta- 
tion of the sufficient statistics, namely, 

St+i = Sj + At I - ^ S(xfc, zt+i^k) - St I 

with 

£t+ii9) = (t>{e)-St+i-iogC{e) 

being maximized in the M-step. Note that this algorithm is fundamentally 
distinct from the SEM method [Celeux and Diebolt (1985)] in which the 
^-step directly defines it+i{0) = Ylk=i^^SQ{^k,2t+i,k;d)- 

A final refinement may be needed in the SAEM algorithm, when directly 
sampling from the posterior distribution is infeasible, or inefficient, but can 
be done using Markov Chain Monte Carlo (MCMC) methods. In this situ- 
ation, there exists, for each 6 and x, a transition probability z i-^ Il:x:,d{z, •) 
such that the associated Markov chain is ergodic and has the posterior prob- 
ability (?(-|X = x; 9) as stationary distribution. The corresponding variant of 
the SAEM (which we shall still call SAEM) replaces the direct sampling 
operation 

zt+i,k ^ i^k,et = 9(-|X = Xfc,6't) 
by a single Markov chain step 

Zt+l,fc ~nxj^,6»t(zt,A:r)- 

This procedure has been introduced and proved convergent for bounded 
missing data in Kuhn and Lavielle (2004). This result has been generalized 
to unbounded hidden random variables in Allassonniere, Kuhn and Trouve 
(2010). 

To ensure the convergence of this algorithm in the noncompact case 
(which is our case in the models above), one needs, in principle, to introduce 
a truncation on random boundaries as in Allassonniere, Kuhn and Trouve 
(2010). This would add a new operation between the stochastic approxima- 
tion and the maximization steps, with the following truncation step. Let S be 
the range of the sufficient statistic, S. Let {ICq)g>o be an increasing sequence 
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of compact subsets of S such as Ug>o^'3 ~ '^ ^^^^ ^g *- int(/Cg4.i), Vg > 0. 
Let {6t)t be a decreasing sequence of positive numbers. If St+i wanders out 
of /Ct+i or if \St+i — St\ > 5t, then the algorithm is reinitiahzed in a fixed 
compact set. 

More details can be found in Andrieu, Moulines and Priouret (2005) and 
Allassonniere, Kuhn and Trouve (2010). In practice, however, our algorithms 
work properly without this technical hedge. 

3.3. Application to our models. To complete the description of the SAEM 
algorithm for a given model, it remains to make explicit (i) the specific 
form of the sufficient statistic S; (ii) the corresponding maximum likeli- 
hood estimate for complete observations; and (iii) the transition kernel 
for the MCMC simulation. Formulae for (i) and (ii) are provided in the 
Appendix for the ICA models we have described here. For (iii), we have 
used a Metropolis-Hastings procedure, looping over the components (some- 
times called "Metropolis-Hastings within Gibbs Sampling") that we now 
describe. This is for a fixed observation x^ and parameter 9, although we 
do not let them appear in the notation. So we let i/ = Ukfi be the probability 
that needs to be sampled from. 

In the Metropolis-Hastings procedure, one must first specify a candidate 
transition probability p(z,z). A Markov chain (Zt,t = 0, 1, . . .) can then be 
defined by the two iteration steps, given Z^: 

(1) Sample z from /?(Zj, •). 



(2) Compute the ratio 



ir, N J^(z)/9(z,Zt) 
r[Zt,z.) - 



and set Z^+i = z with probability min(l,r) and "Lt+i = T^t otherwise. 

An interesting special case is when p corresponds to a Gibbs sampling proce- 
dure for the prior distribution, Qmi'^', &)■ Given the current simulation z, one 
randomly selects one component z^ and generates z by only changing z^ , re- 
placing it by z^ sampled from the conditional distribution qm{z^z'^,i ^ j;9). 
In this case, it is easy to see that the ratio r is then given by 

., . g(xfc|z) 
r{z,z) = - — — . 
9(xfc|z) 

The Markov kernel is then built by successively applying the previous kernel 
to each component. 

Our implementation follows this procedure whenever the current set of 
parameters leads to an irreducible transition probability p. This is always 
true, except for the censored models, in which parameters a € {0, 1} or 
7 G {0, ^} are degenerate and must be replaced by some fixed values ao 
and 7o in the definition of p. 
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4. Reconstruction. Assuming that tlie parameters in the model are known 
or have been estimated, the reconstruction problem consists in estimating 
the hidden coefficients of the independent components, /3 £ M^, based on 
a new observation of x S M . As noticed in the Introduction, this is a sepa- 
rate problem. Estimating model parameters is based on the likelihood of the 
observation, which integrates out hidden variables. In contrast, reconstruct- 
ing hidden decomposition vectors from data is typically done by minimizing 
a chosen loss function for a fixed choice of model parameters, and is based 
on the posterior likelihood (proportional to the complete likelihood). Even if 
this does not constitute our main focus here, we briefly describe in this sec- 
tion how the MAP estimator, based on maximizing the complete likelihood, 
can be achieved using the models presented in this paper. 

Reconstruction with probabilistic ICA models is not as straightforward as 
with complete ICA, for which the operation reduces to solving a linear sys- 
tem. A natural approach is maximum likelihood, that is, (with our notation) 
find z = argmaxz (j){9) ■ 5(x,z) and deduce (3 from it. 

This maximization is not explicit, although simpler for our first two mod- 
els. Indeed, for Log-ICA, this requires minimizing 

^|x-A/3p + 2X:iog(e/^^+e-/^^). 
i=i 

(We take /ig = in this section, replacing, if needed, x by x — /Xg.) 
The Laplacian case, Lap-ICA, gives 

^|x-A/3p + ^|/3^|. 
i=i 

Both cases can be solved efficiently by convex programming. The Lapla- 
cian case is similar (up to the absence of normalization of the columns of A) 
to the Lasso regression algorithm [Tibshirani (1996)], and can be minimized 
using an incremental procedure on the set of vanishing /3-^'s [Efron et al. 
(2004)]. 

The other models also involve some form of quadratic integer program- 
ming, the general solution of which being NP-complete. When dealing with 
large numbers of components, one must use generally suboptimal optimiza- 
tion strategies (including local searches) that have been developed for this 
context [see Li and Sun (2006), e.g.]. 

The EG-ICA problem requires minimizing 

p 2 p p 



2(72 









with s^, . . . , s^ > 0. This is not convex, but one can use in this context an 
alternate minimization procedure, minimizing in y with fixed s and in s 
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with fixed y. The first problem is a straightforward least squares and the 
second requires quadratic programming. 

The symmetrized IFA model leads to minimize 

1 1 ^ ^ 

^|x - A/3|2 + - Y,{P' - ^rn^^f + J^logu;,, 

i=i i=i 

with respect to /3, the unobserved configuration of labels t, and the sign 
change b. When labels and signs are given, the problem is quadratic in /3. 
Given (3 and t, the optimal b is explicit, and for fixed /3 and b, the search 
for an optimal t reduces to a quadratic integer programming problem. For 
small dimensions, it is possible to make an exhaustive search of all {2K + 1)^ 
possible configurations of labels and signs. 
For the BG-ICA, we must minimize 



1 
2^ 



V b>v^a 



j=i 



3 = 1 j=l 



f^)' 



with p = log((l — a)/a) and V G {0, 1}. The minimization in b is a (0, 1)- 
quadratic programming problem, an exhaustive search being feasible for 
small p. Given b, the optimal y is provided by least squares. 
Concerning the EBG-ICA, we must minimize 



1 

2^ 



X 



^s^Vy^aj 



+ 



p 

E 



s^ +P 



T.^ + \ 



p 



(y'-f^f 



with p = log((l — a)/a), s^,. . . ,8^ > and b^ G {0, 1}. This is again a (0, 1)- 
quadratic programming problem in b and, given b, the optimal y and s are 
computed similarly to the EG-ICA model. 
With ET-ICA, the objective function is 

2 
1 



2a2 



X 



p 



s-^y-'sij 



J^s^+p 



j=i 



p 

E 



r 



with /9 = log((l- 27)727), 2/\...,yPG {-1,0,1} and s\...,sP>0. This is 
a quadratic integer programming in y, with a complexity of 3^ for an ex- 
haustive search. Given y, computing s is a standard quadratic programming 
problem. 

The TE-ICA problem, requiring to minimize 

n 2 



1 

2^ 



X 



'E 



y^a.j 



p 



is slightly simpler since, in this case, the computation of s > given y is 
straightforward. 
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The TEoff-ICA model involves a third hidden variable /i. This leads to 
the following objective function to minimize both in s, y and fi: 

2 
1 



2(j2 



x-Ai-s^y^aj 






with /I = (/I, . . . , /i) G M , and s > 0. Given /i, the minimization with respect 
to s and y is done as in the previous TE-ICA model. The minimization 
over /i has a closed form: 



i=i \ 1=1 / 



5. Experiments. 

5.1. Synthetic image data. 

5.1.1. Data set. We first provide an experimental analysis using syn- 
thetic data, which allows us to work in a controlled environment with a known 
ground truth. In this setting, we assume that the true distribution is the 
Bernoulli-Gaussian (BG) model, with two components (p = 2). The prob- 
ability a of each component to be "on" is set to 0.8. We run experiments 
based on 30, 50 or 100 observations, and vary the standard deviation of the 
noise using a = 0.1, 0.5, 0.8, 1.5. 

The components are represented as two-dimensional binary images (grey 
levels being either or 1). The first one is a black image (grey level equals 0) 
with a white cross (grey level 1) in the top left corner. The second one has 
a white square (same grey level) in the bottom right corner. These two 
images are shown in Figure 1. Figure 2 presents 30 images sampled from 
this model with the different noise levels. The training sets were sampled 
once and used in all the comparative experiments below. We used a fixed 
color map for all figures to allow for comparisons across experiments (this 
explains why the patterns in Figure 1 appear as grey instead of white). 

5.1.2. Interpretation of the results. We have compared the following es- 
timation strategies: (1) FAM-EM algorithm [Grimes and Rao (2005), 01- 
shausen and Field (1996a), Tenenbaum and Freeman (2002)] (which max- 
imizes the likelihood with respect to parameters and hidden variables to- 
gether) with the Log-ICA model (Logistic distribution); (2) SAEM with 



Fig. 1. Two decomposition images used for sampling synthetic data. 
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Fig. 2. Samples of the training sets used for synthetic data with different level of noise, 
a = 0.1, 0.5,0.8, 1.5 are upper left, upper right, lower left and lower right, respectively. 

the same Log-ICA model; (3) SAEM for the IFA model, and (4) EM with 
the IFA model [Attias (1999), Welling and Weber (2001)]; (5) SAEM for 
the true BG-ICA model; (6) finally, we also ran a standard ICA decom- 
position using fast-ICA [Hyvarinen and Oja (1997)] with a requirement of 
computing only two components (with a preliminary dimension reduction 
based on PC A). Models (3) and (4) are theoretically equivalent, and our 
experiments evaluate how they differ numerically. We reemphasize that the 
EM algorithm for the IFA model is only feasible for a reasonably small 
number of components, p, and number of mixtures, K (with a complex- 
ity in RP), whereas this limitation does not apply to the SAEM algorithm 
(see the Appendix for more details). For other alternative approaches to 
the EM for the IFA model (including the use of the FAM-EM strategy), 
see Brandt Petersen and Winther (2005), Come et al. (2008), Grimes, Shon 
and Rao (2003), Valpola Lappalainen and Pajunen (2000), Varoquaux et al. 
(2010). The fast-ICA algorithm used in (6) is nonparametric (and maximizes 
an approximation of the negentropy of the model) . 

We also notice that (1), which requires minimizing in A,a'^ and b, is 
ill-posed because a transformation {A,h) — ?> {XA,h/X) always decreases the 
likelihood when A > 1, which implies that the optimal A is unbounded. To 
address this, one solution is to use a prior distribution for A, or enforce some 
normalization. We chose the latter option, enforcing the empirical mean 
square of all b's to be equal to log 2 as implied by the logistic distribution. 

Table 1 provides mean-square errors for the estimation of A based on 
different models and algorithms, and for different noise levels and sample 
size. Each error is computed from 50 repeats of the full experiment (sampling 
from the true model followed by estimation). The mean square error (MSE) 
is defined by 

MSE = -^ ^[(Aest(x, 1) - Atruc(a;, 1))' + (^cst(x,2) - Arue(x,2))2], 
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Table 1 

Mean-square estimation error for various combinations of algorithms and models based 

on estimations based on 30/100 samples and several noise levels. Each mean-square error 

IS an average over 50 independent repeats 





30 


images per training 


set 


100 


images per training 


set 


Algo/model 


o- = 0.1 


o- = 0.5 


o- = 0.8 


(T= 1.5 


o- = 0.1 


o- = 0.5 


o- = 0.8 


CT= 1.5 


FAM-EM/Log 


0.55 


0.49 


0.51 


0.82 


0.52 


0.47 


0.46 


0.62 


SAEM/Log 


0.05 


0.06 


0.10 


0.26 


0.03 


0.06 


0.06 


0.11 


SAEM/IFA 


0.19 


0.18 


0.16 


0.20 


0.16 


0.16 


0.09 


0.10 


EM/IFA 


0.05 


0.04 


0.06 


0.15 


0.05 


0.03 


0.03 


0.06 


SAEM/BG 


0.09 


0.13 


0.16 


0.6 


0.07 


0.07 


0.05 


0.25 



where A is the grid of pixels, | A| its cardinaHty, ^est is the estimated decom- 
position matrix and ^truc the true one (up to a permutation and a change 
of sign) . The lack of monotonicity in mean squared errors with respect to a 
may come from the small number of simulations which are averaged here. 
The estimation is to proceed 50 times but a larger number of simulations 
would solve the problem. 

We also evaluated the accuracy of the estimation of o"^. The results are 
presented in Table 2. A surprising result is that a^ is always well estimated 
even when the decomposition vectors are not. This is an important obser- 

Table 2 

Estimated noise variance with the different models and the two different algorithms for 

30, 50 and 100 images m the training set. These variances correspond to the estimated 

decomposition vectors presented in Figure 3 









Algo/model 








True o-^ 


FAM-EM/Log SAEM/Log EM/IFA SAEM/IFA SAEM/BG 


30 images in the 


0.001 


0.0088 


0.0086 


0.0097 


0.0089 


0.0087 


training set 


0.2500 


0.2253 


0.2224 


0.2240 


0.2410 


0.2226 




0.6400 


0.5685 


0.5577 


0.5534 


0.6092 


0.5569 




2.2500 


2.0375 


1.9978 


2.1199 


2.0735 


2.0009 


50 images in the 


0.001 


0.0095 


0.0092 


0.0095 


0.0094 


0.0092 


training set 


0.2500 


0.2400 


0.2399 


0.2363 


0.2524 


0.2399 




0.6400 


0.5831 


0.5798 


0.6381 


0.6429 


0.5795 




2.2500 


2.1544 


2.1377 


2.2061 


2.2112 


2.1366 


100 images in the 


0.001 


0.0176 


0.0097 


0.0095 


0.0098 


0.0097 


training set 


0.2500 


0.2432 


0.2459 


0.2455 


0.2564 


0.2456 




0.6400 


0.6225 


0.6282 


0.6336 


0.6388 


0.6280 




2.2500 


2.1268 


2.1479 


2.1767 


2.1970 


2.1490 
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Algo/Modcl 

FAMEM/Log 

SAEM/Log 

SAEM/IFA 

EM/IFA 

SAEM/BG 

FastICA 

Algo/Model 

FAMEM/Log 

SAEM/Log 

SAEM/IFA 

EM/IFA 

SAEM/BG 

FastICA 



30 images per training set 50 images per training set 

a = 0:i a = 0.5 (7 = 0.8 ct = 1,5 ct = 0.1 ct = 0.5 <t = 0.8 ct = 1,5 

BB UK B^ i^^ BB BB BB ^^ 



100 images per training set 
a = 0.1 a = 0.5 a = 0.8 a = 1.5 

BB BB BB Sf» 



Fig. 3. Estimated decomposition images with different models and algorithm. The esti- 
mation becomes less and less satisfactory as the noise variance increases. It is even more 
pregnant for the FAMEM and FastICA algorithms for which increasing the number of ob- 
servation does not address this problem. The other models estimated using our algorithm 
provide similar results and the noise does not drastically affect the estimation. 



vation which indicates that one should not evaluate the final convergence of 
any of these algorithms based on the convergence of a'^ only. 

A visual illustration of these results is provided in Figure 3, in which 
a single (typical) experiment is displayed for each noise level and sample size. 
The algorithms that maximize the likelihood of the observed data (SAEM, 
MCEM and EM for the IFA) all provide results that are consistent with the 
ground truth, even when the model used for the estimation differs from the 
true one. This statement does not apply to the FAM-EM algorithm (which 
maximizes the likelihood with respect to parameters and hidden variables 
together), or to FastICA, which degrade significantly when the noise is high. 
We also experienced numerical failures when running the publicly available 
software with high noise (we had, in fact, to resample a new 100-image 
training set to be able to present results from this method). 

Since these two algorithms both rely on Monte Carlo sampling, we have 
compared the performances of our SAEM with a Log-ICA model and of 
the Monte Carlo (MC) EM algorithm. The expectation step of the EM is 
replaced by an approximation of the expected value of the sufficient statis- 
tics using a Monte Carlo sum. Therefore, at each iteration of the algorithm, 
MCEM requires repeated samples from the posterior distribution of the hid- 
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100 training images 



Algo/Model 

SAEM/Log 
MCEM w/ 10 samp. /Log 
MCEM w/ 30 samp. /Log 



a = 0.1 



o- = 0.5 



(7 = 0.8 



a = 1.5 



MSE / Time cost in seconds 



SAEM/Log 
MCEM w/ 10 samp./Log 
MCEM w/ 30 samp./Log 



0.0926 / 13.51 
0.1558 / 124.01 
0.2189 / 369.65 



0.0515 / 13.43 
0.0343 / 125.82 
0.0643 / 370.65 



0.0646 / 13.29 
0.0511 / 125.13 
0.0339 / 370.15 



0.0459 / 14.17 
0.0847 / 125.18 
0.0614 / 372.65 



Fig. 4. Comparison between MCEM and SAEM with the Log-ICA model. The top images 
show the decomposition vectors estimated with either model and for different numbers of 
Monte Carlo samples used to approximate the expectation in the MCEM. The table presents 
the mean square error (MSE) and the time cost of each estimation. The results look very 
similar (except for low noise variance where the MCEM seems to behave like the FAM-EM), 
while the time cost of the MCEM increases linearly in the sample size. 



den variables given the observations. Larger samples yield a better approx- 
imation and generally result in fewer EM iterations to achieve convergence. 
Of course, this also implies a computational cost per iteration which grows 
linearly in the sample size. Notice also that we cannot generate independent 
samples from the posterior distribution, but only Markov chain samples re- 
sulting from the MCMC sampler described in Section 3.3. These samples are 
therefore correlated and only asymptotically sample from the posterior dis- 
tribution. A comparison of the output of this algorithm and of the proposed 
(MCMC-)SAEM is displayed in Figure 4. We ran 1,000 iterations, using 10 
and 30 samples in each Monte Carlo approximation, and the estimation is 
based on 100 observations. The results are similar, whereas the time cost is 
about the number of samples (10 or 30) times longer for the MCEM than 
for SAEM. Decreasing the number of samples accelerates the estimation but 
degrades the estimations, in particular, when the noise level is high. 

We have also made a broad comparison of the required computation time 
associated to each algorithm. One must remember, when interpreting these 
results, that each algorithm optimizes its own objective function, and only 
the ones of the EM/IFA and SAEM/IFA coincide. While the objective func- 
tions of the SAEM models can be considered as similar, the one associated 
with the FAM-EM is quite different, and the comparison must be done with 
this in mind. 

Another difficulty in computing these numbers is that the true solution 
(maximum likelihood, or mode) is unknown, and even if it were known, all 
methods are prone to converge to a local maximum and never get close to it. 
Because of this, we have used an empirical definition of the convergence time 
as the first time at which the maximal subsequent variation of the current 
solution is less than 1/1,000 of what it was initially. More precisely, if A{t) 
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Table 3 
Comparison of computation costs. Second column: average time, in seconds, for 1, 000 
iterations of each algorithm. Columns 3 to 6: average number of iterations to achieve 

convergence 





Time for 1,000 iterations 


Number of iterations to convergence 


Algo/model 


o- = 0.1 


cr = 0.5 


o- = 0.8 


cr = 1.5 


FAM-EM/Log 


206 s 


3,700 


3,000 


1,300 


600 


MCEM/Log 


140 s 


(4,800) 


(4,900) 


(5,000) 


(5,000) 


SAEM/Log 


14 s 


230 


980 


1,720 


2,390 


EM/IFA 


1,600 s 


4,400 


350 


140 


50 


SAEM/IFA 


33 s 


1,510 


2,240 


2,920 


3,670 


SAEM/BG 


26 s 


530 


1,210 


1,900 


3,660 



is the estimated component matrix at step f, and d{t) = maxj/>t|A(t') — 
^(^max)li the convergence time defined as 

(5.1) tconv = min{t : d{t) < d(l)/l,000} 

(imax being the maximal number of iterations, equal to 5,000 in our experi- 
ments) . 

These results are summarized in Table 3. As expected, the times per 
iteration of the SAEM-based methods are much smaller than with other 
approaches. This is only partially compensated by an increased number of 
iterations in order to achieve convergence. Note that, in this table, the num- 
ber of steps to convergence for the MCEM is close to imax = 5,000, which 
indicates that (5.1) has not been satisfied before the maximal number of 
iterations. Note also that the studied model, with two independent compo- 
nents, is the most favorable for the EM/IFA algorithm, which would become 
intractable with a higher number of components. 

Another interesting (and difficult to explain) observation from this table 
is that the deterministic algorithms (FAM-EM and EM for the IFA) seem 
to require fewer iterations at high noise level, while the trend is opposite 
for the stochastic methods. A final remark is that the fastICA algorithm is 
much faster than any of these methods when run after reducing the model 
dimension using PCA. 

5.2. Effect of the number of estimated com,ponents. We now illustrate, 
with a different model, how, for censored models, the estimation of the cen- 
soring coefficient evolves with the number of components. In this experiment 
we have generated 1,000 samples of a shifted Bernoulli-Gaussian model (see 
Section 2.9) with 8 components (the components being represented as in- 
dicators of 8 nonoverlapping intervals). The true value of a is 0.5, and we 
took yU = 2. In Figure 7 we plot the value of the estimated a as a function 
of the number of components in the model, p. We can see that this value 
seems to decrease to zero, at a rate which is, however, not linear in 1/p. 
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Fig. 5. Estimated components with probabilistic ICA. The sample contains 1,000 signals 
generated by a shifted Bernoulli-Gaussian model (see Section 2.9) with 8 components (the 
components being represented as indicators of 8 nonoverlapping intervals). The true value 
of a is 0.5, and we took /i = 2. Left: components estimated with p — 6. Right: components 
estimated with p = 8. When estimating only 6 components, two sources appear in the same 
component which will make them always appear together with the same weight. This is 
what can be seen pointed by the arrows. However, when estimating 8 components, the 8 
sources are recovered. 



The expected number of nonzero components grows from 2 for p = 2, to 8 
when p = 8 (correct value — pointed in red in the plot), to about 10 when 
p = 50. The estimated components for p = 6, 8 and 15 are plotted in Fig- 
ures 5 and 6. This illustrates the effect of under-dimensioning the model, in 
which some of the estimated components must share some of the features 
of several true components (pointed by arrows), and of over-dimensioning, 
in which some of the estimates components are essentially noise (clearly 
indicating overfitting of the data — in red rectangles), while some other esti- 
mated components, which correspond to true ones, are essentially repeated 
(twice for the components marked with red and green crosses). Components 
are correctly estimated when the estimated model coincides with the true 
model {p = 8). 

Although we are not addressing the estimation of the number of compo- 
nents in this paper, these results clearly indicate that this issue is important. 
We refer the reader to standard approaches in this context, based on penaliz- 
ing model complexity, using, for example, the Akaike Information Criterion 
(AIC) [Akaike (2003)] or the Bayesian Information Criterion (BIC) [Maugis, 
Celeux and Martin-Magniette (2009), Schwarz (1978)]. Using a Bayes prior 
on parameters would be possible, too, with a straightforward adaptation of 
the SAEM algorithm. 

5.3. Handwritten digits. We now test our algorithms on some 2D real 
images. The first training set we use is the USPS database, which contains 
7,291 grey-level images of size 16 x 16. We used the whole database as a train- 
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Fig. 6. Estimated components with probabilistic ICA. The sam,ple contains 1,000 signals 
generated by a shifted Bernoulli-Gaussian model (see Section 2.9) with 8 components (the 
components being represented as indicators of 8 nonoverlapping intervals). The true value 
of a is 0.5, and we took ^ = 2. Components estimated with p = 15. We can see that 15 is 
too many since among the 15 sources found, we can recognize some noise (squared by red 
rectangles) and repeated components (red crosses and green crosses are similar with each 
other). 

ing set and computed 20 decomposition vectors. Some images from this data 
set are presented in Figure 8(left). 

The different decomposition vectors and the estimated means (when it is 
a parameter) are presented in Figure 9. Each 10 by 2 set of 20 images on 
the right column corresponds to one run of the algorithm for a given model 
(selecting the most representative results). 

Interestingly, the results highlight the advantage of the censored models 
compared to the continuous ones in such situations. Modeling component 




50 p 



Fig. 7. Estimated component activation probability (a) as a function of the model size for 
a Bernoulli Gaussian model estimated on the 1,000 signals of a shifted Bernoulli-Gaussian 
model. Ground truth is p — 8 and a — 0.5 (red point). 
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Fig. 8. 100 images randomly extracted from the USPS database (left) and from the face 
category in the CaltechlOl data set (right). 

coefficients tliat can vanisli with positive probability (such in BG and ET- 
ICA) enables to have decompositions which do not involve vectors shared 
by all the training sample. Considering a data set such as USPS, one im- 
age in one class is not easily expressed as a mixture of images from other 
classes. Therefore, it is not appropriate to express it as a linear combination 
of all the decomposition vectors with nonzero coefficients. This means that 
we expect the decomposition vectors to be separated digits and appearing 
only for samples belonging to the corresponding class. This is what we see 
with the censored models (Figure 9, lines 2 to 4), many decomposition vec- 
tors represent well-formed digits, whereas decomposition vectors for other 
models (Figure 9, line 1) mix several digits more often to be able to cancel 

° " BBHHSaiilRlHHH 



■ 71 



Fig. 9. Results of the independent component estimation on the USPS database using 
four selected models. The training set is composed of 7,291 images containing the 10 digits 
randomly spread. Left column: mean image ^o ■ Right column: 20 estimated decomposition 
vectors. (See Figure 1 in the supplementary file [Allassonniere and Younes (2011)] for 
a larger image.) 



26 S. ALLASSONNIERE AND L. YOUNES 

the nonexpected features. These binary or ternary models seem to be very 
adequate in such situations. 

Note that the USPS data set does not have the same number of images 
of each digit. There are about twice as many O's or I's as other digits. This 
fact explains the "bias" one can see on the mean, on which the shape of the 
zero is noticeable. In all experiments, the trace of each digit can be (more or 
less easily) detected in at least one of the components, at the exception of 
digit 2. This is probably due to the large geometrical variability of the 2's, 
which is much higher than other digits (changes of topology-loop or not, 
changes in global shape) and therefore difficult to capture. 

5.4. Face images. We have run a similar experiment on a data set of face 
images (extracted from the CaltechlOl data set). Each of these images has 
been decomposed into patches of size 13 x 13, with some of them presented 
in Figure 8(right). The resulting database contains 499,697 small images and 
we estimated 20 decomposition vectors. Results are presented in Figure 10. 
The patterns which emerge from the estimations are quite similar from one 
model to another: vertical, horizontal and diagonal separation of the image 
into black and white, blobs, regular texture like a regular mesh, etc. 

We also ran the same estimation with two of the previous models looking 
for 100 decomposition vectors. The results are presented in Figure 11. We 
selected the Log and BG-ICA since one has a continuous density and the 
second has a semi-discrete one. The results are rather different. While the 
Log-ICA model tends to capture some textures, the BG-ICA captures some 
shapes. In this example, as well as with the digit case, the sparsity of the 
decomposition makes sense and plays an important role. This database is 
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Fig. 10. Decomposition vectors from six selected models. From left to right and top to 
bottom: Log-ICA, Lap-ICA, EG-ICA, BC-ICA, EBG-ICA, ET-ICA, TE-ICA, TEoff-ICA. 
For each model the top row is the mean image and the bottom rows are the 20 corresponding 
decomposition vectors. (See Figure 2 m the supplementary file [Allassonniere and Younes 
(2011)] for a larger image.) 
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Fig. 11. 100 decomposition vectors from 2 m,odels. Left: Log-ICA. Right: BG-ICA. (See 
Figure 3 in the supplementary file [ Alias sonniere and Younes (2011)] for a larger image.) 

composed of discrete features which can hardly be approximated by a hnear 
combination of continuous patterns. The models generating sparse represen- 
tations again seems to be better adapted to this kind of data. 

5.5. Anatomical surfaces. We finally consider a data set containing a fam- 
ily of 101 hippocampus surfaces that have been registered to a fixed template 
using Large Deformation Diffeomorphic Metric Mapping [Miller, Trouve and 
Younes (2002, 2006), Trouve (1998), Trouve and Younes (2002)]. We here 
analyze the logarithm of the Jacobian determinant of the estimated defor- 
mations, represented (for each image) as a scalar field over the surface of the 
template, described by a triangulated mesh. These vectors have fixed length 
{d = 3,223) equal to the number of vertices in the triangulation. 

The 101 subjects in the data set are separated in 3 groups with 57, 32 and 
12 patients, containing healthy patients in the first group and patients with 
Alzheimer's disease and semantic dementia (denoted the AD group later) at 
different stages in the last two groups. 

Using our algorithm, we have computed p = 5 decomposition vectors 
based on the complete data set. Figures 13 to 15 present these decomposi- 
tion vectors mapped on the meshed hippocampus for six selected models. 
The estimated mean is shown on the left side and the five corresponding 
decomposition vectors are on the right-hand side. Images are presented with 
different color maps to facilitate the visualization of the patterns. In partic- 
ular, even if the means seem to contain a lot of information, their intensities 
vary on a very small scale compared to all the decomposition vectors (they 
are actually close to 0). 

Although results vary with the chosen model, we can see common features 
emerging. First of all, the means are very similar to each other. The patterns 
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Table 4 

Mean and standard deviation of the p-values for the eight models with the five 

decomposition vectors shown in Figures 13 to 15. The mean and the standard deviation 

are computed over 50 samples of the posterior distributions of the hidden variables to 

separate the first group (Control) with respect to the two others (AZ) 



Model 


Log-ICA Lap-ICA 


EG-ICA 


BG-ICA 


EBG-ICA 


Mean on log lO^'^x 

Std deviation on log 10~^ x 


0.31 0.29 
0.16 0.19 


0.27 
0.12 


0.33 
0.25 


0.9 
1.2 


Model 


ET-ICA 


TE-ICA 




TEoff-ICA 


Mean on log 10~^x 

Std deviation on log 10~^ x 


0.27 
0.14 


2.4 
2.9 




75.7 
126.2 



which we can notice on each of them is the same. For example, there is 
a noticeable contraction on the top part and an extension on the bottom left- 
hand side of the shape. These deformations, however, have a small amplitude 
and can be interpreted as the "bias" of the training set with respect to the 
template. Concerning the decomposition vectors themselves, the pattern of 
the first vector of the Logistic model is present in all other models [e.g., in 
position 1 for the Laplacian, EG, TE and TEoff models (not shown here), 4 
for the BG model, 5 for the EBG (not shown here) and 2 for the ET model]. 
Other patterns occur also, like a contraction or a growth of the tail part [in 
vector 3 of Log, Lap, EG, BG, EBG, TEoff (not shown here) and 5 of TE] 
or on the bottom of the left part of the image [in vectors 4 and 5 of Log, 
5 of Lap, EG, BG and TEoff (not shown here) and in vector 1 otherwise]. 
These common features seem to be characteristic of this population. 

Even if a careful justification of the following statements would require 
a more thorough study, which would fall out of the scope of the present 
paper, these ICA patterns seem to correlate with anatomical hippocampus 
regions, such as those introduced in Miller et al. (2009) and Wang et al. 
(2006), in the sense that the supports of the decomposition vectors are lo- 
cated within subregions of the anatomical segmentation. For example, the 
first and third components from the log-ICA decomposition significantly 
overlap with what authors in Wang et al. (2006) refer to as the hippocam- 
pus lateral zone, while components 3 and 5 are contained in the superior 
zone, and component 2 in the interior-medial zone. Similar conclusions ap- 
ply with most decomposition vectors obtained with other ICA methods. 

In Tables 4 and 5 we provide the p-values obtained from the compari- 
son of the five ICA coefficients (/?) among the three subgroups. The test 
is based on a Hoteling T-statistic evaluated on the coefficients, the p-value 
being computed using permutation sampling. The test is performed for two 
different comparisons: first we compare the healthy group with respect to 
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Table 5 

Mean and standard deviation of the p-values for the eight models with the five 

decomposition vectors shown in Figures 13 to 15. The mean and the standard deviation 

are computed over 50 samples of the posterior distributions of the hidden variables to 

separate the first group (Control) with respect to the second one (mild AZ) 



Model 


Log-ICA Lap-ICA 


EG-ICA 


BG-ICA 


EBG-ICA 


Mean on log lO^'^x 

Std deviation on log 10~^ x 


9.0 9.6 
3.8 4.8 


8.3 
2.7 


10.9 
7.6 


18.7 
17.7 


Model 


ET-ICA 


TE-ICA 




TEoff-ICA 


Mean on log 10~^ x 

Std deviation on loglO^^ x 


8.9 
4.6 


30.8 

28.8 




148.7 
160.4 



the two pathological groups. This is what is shown in Table 4. The second 
test compares the healthy group with the group of 32 mild AD patients. The 
results are presented in Table 5. 

Because SAEM is stochastic and only expected to converge to a critical 
point of the likelihood (which may not be unique), different runs of the 
algorithm starting from the same initial point can lead to different limits. 
To evaluate the effect of this variability, we ran the algorithm for each model 
50 times, with the same initial conditions, and computed an average and 
a standard deviation of the p-values. 

The results are mostly significant. Indeed, almost all methods yield p- 
values under 1% when we compare the control population to the AD groups 
and less than 3% for the comparison of the control versus mild AD. 

The only model which does not yield significant p-values is the offset 
case. Both the mean and standard deviation are high (even higher when we 
focus on the mild AD population). This suggests that this model on this 
database is unstable. One run can lead to significant decomposition vectors 
and a second one can lead to very different results. This particular model, 
which worked well with the USPS database, for example, does not seem to 
be adapted to this type of data that is considered here. The mean is very 
close to zero and is therefore not a relevant variable for this application. 
The additional variability in the model may have an adverse effect on the 
estimation. In cases where the dimension of the data is much larger than the 
number of samples in the training set, it is natural to think that adding more 
variability in the estimation process may lead to unstable results and there- 
fore large variance of estimated parameters. Depending on this paradigm, 
the user may prefer to reduce the number of random variables to the de- 
composition vector weights only. 

Figure 12 provides some insight in the way components are turned on/off 
by the ET-ICA model, by plotting the estimated probability, 7 = PiXl ~ 
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Fig. 12. Evolution of the probability of one component to activate or inhibit the corre- 
sponding decomposition vector in the ET-model with respect to the number of decomposition 
vectors. The training set is the set of 101 hippocampi. 



— 1) = P(Xk ~ l); against the number of decomposition vectors, p. As al- 
ready noticed in Section 5.2, for small p, all components are needed, yielding 
7 ~ 1/2. When more components are added, they do not need to appear all 
the time, yielding a decreasing value of 7. 

6. Conclusion and discussion. This paper presents a new solution for 
probabilistic independent component analysis. Probabilistic ICA enables to 
estimate a small number of features (compared to the dimension of the data) 
which characterize a data set. Compared to plain ICA, this avoids the insta- 
bility of the computation of the decomposition matrix when the number of 
observations is much smaller than their dimension. We have demonstrated 
that the stochastic approximation EM algorithm is an efficient and pow- 
erful tool which provides a convergent method that estimates the decom- 
position matrix. We have shown that this procedure does not restrict the 
large choice of distributions for the independent components, as illustrated 
by eight models with different properties, mixing continuous and discrete 
probability measures, that we have introduced and studied. 

Future works will be devoted to the analysis of nonlinear generative mod- 
els that allow for the analysis of data on Riemannian manifolds, including 
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Fig. 13. Left: mean (left) and 5 decomposition vectors estimated with the Log-ICA model. 
Right: mean (left) and 5 decomposition vectors estimated with the Lap-ICA model. Each 
image has its own color map to highlight the major patterns. 
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Fig. 14. ie/i; mean (left) and 5 decomposition vectors estim,ated with the EG-ICA model. 
Right: mean (left) and 5 decomposition vectors estimated with the BG-ICA model. Each 
image has its own color map to highlight the major patterns. 

the important case of shape spaces m which the models generate nonhnear 
deformation of given templates. Generalizations of the methods proposed in 
Allassonniere, Amit and Trouve (2007) and Allassonniere and Kuhn (2010) 
will be developed, in order to estimate both the templates and the generative 
parameters. 

APPENDIX A: PROOF OF THE SUBEXPONENTIAL TAIL OF THE 

EG-DISTRIBUTION 

Let {Y, S) be a pair of independent random variables where Y and S have 
a standard normal distribution and an exponential distribution, respectively. 
Let /3 = YS and assume t > so that /3 > t implies y > 0. We have [letting 

C7 = (27r)-i/2] 

P(/3 > t)=¥{s > t/y,y> 0)=cj f(s>-] exp(-^yA dy 
= C I expf --y^ - -]dy. 
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Fig. 15. Left: mean (left) and 5 decomposition vectors estimated with the ET-IGA model. 
Right: mean (left) and 5 decomposition vectors estimated with the TE-IGA model. Each 
image has its own color map to highlight the major patterns. 
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Let ht{y) = \y^ + ^. We can write, letting z = yjt^l'^ , 

hAy) = ^^^l^^t^l^a{z) 

with a(z) = ((z — 1)^ + z~^ + z — 2)/2. Making the change of variables y ^ z 
in the integral yields 

fOO 

JO 

U sing Laplac e's method, we find that the second integral is equivalent to 
V^27r/(6tV3)^ proving that ^{P > t decays like ti/*^exp(-3t2/3/2) when t -^ 

+00. Note that the density of (3, which is g{f3) = f^ exp{—^y'^ — ^) -^, has 
a singularity at /3 = 0. 

APPENDIX B: MAXIMUM LIKELIHOOD FOR THE COMPLETE 

MODELS 

The M-step in our models requires solving the equation Eg{S) = [S] 
where [S] is a prescribed value of the sufficient statistic (an empirical aver- 
age for complete observations, or what we have denoted Sj in the M-step of 
the learning algorithm). In the next sections we provide the expressions of 
S for the family of models we consider and give the corresponding solution 
of the maximum likelihood equations. Notice that these are closed- form ex- 
pressions, ensuring the simplicity of each iteration of the SAEM algorithm. 

B.l. Log-ICA and Lap-ICA models. For these models, the log-likelihood is 

2 

2 



E^(/3^)-i 



2a 



logC{a\A) 



i=i 

where ^(/3) = 21og(e'' + e~^) in the logistic case, and ^(/3) = |/3| in the Lapla- 
cian case. As customary, and to lighten the formulae, we let /J*^ = 1 and 
ao = /Xq) so that f3 and A have size d+ 1, and remove /Xq from the ex- 
pressions for this model and the following ones. We will also leave to the 
reader the easy modifications of the algorithms in the case of shifted models 
described in Section 2.9. 

The likelihood can be put in exponential form using the sufficient statis- 
tic S = {(3(3 , X/3 ) , from which the maximum likelihood estimator can be 
deduced using 

rA = [X/3^]([/3/3^])-i, 

\a^ = i([|X|2] - 2(A, [KP^])f + (A^A, [P(3^])p = [|X - A(3\^]/d), 

where (•, ■)f refers to the Frobenius dot product between matrices (the sum 
of products of coefficients). 
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B.2. EG-ICA model. The likelihood is 






j=i 



j=i 



1 

2^ 






logCia^A) 



with sufficient statistic S = {(3(3 ,X/3 ) with (3^ = s^Y^. The maximum 
likelihood then is 

a2 = [|X-A/3|2]/(i. 

B.3. IFA model. The complete log-likelihood of the Independent Factor 
Analysis model for a single observation X is 

p 2 



1 
2^ 



,m,w 



^^(/3^-&'m,.)2 + ^logt/;,.-logC(A,a,] 

This formulation leads to the following sufficient statistics: 

S = [ 5o = ^ li^.=fc, 5i = f] lt^.=fc^/3^ /3/3^, X/3^ J . 
V i=i i=i / 

The estimator associated to averaged values of these statistics (denoted 
as above with brackets) is 

(A = [Xp^]{[(3(3^])-\ 
a2 = [|X-A/3|2]/d, 

"ifc = [-5*1] /[So], 
_ Wk = [So] /p. 

For this model, it is also possible to compute the conditional distribution 
of the hidden variables, (3, t and b given observed values of X [Attias (1999)] . 

Indeed, for given b and t, let /ib,t = [b^rn^i ,... , iPmtv). Let A = (Mrp H 3-) 

and, for a given X, /ib,t,x = A(^^X + /ib,t)- Then, a rewriting of the likeli- 
hood above shows that the conditional distribution of /3 given X, T and b 
is Gaussian with mean /ib,t,x and covariance A, and that the conditional 
distribution of (t,b) is 



7r(t,b|X) oc expj'-^d/ib.tP - (^^^ + /Ub,t)^A(^^^ + /Ub,t))) n 



WtJ- 



Using these expressions, the -E-step of the EM-algorithm can be computed 
exactly, but it requires computing all {2K + 1Y conditional probabilities 7r(t, 
b|X), which becomes intractable for large dimensions. In contrast, each 
step of the SAEM algorithm only requires sampling from the conditional 
distributions, and has complexity of order p{2K + 1). 
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The same remark on the feasibihty of the EM algorithm holds for all our 
models with discrete variables (BG-ICA, ET-ICA, etc.), for which the E- 
step of the algorithm can be made explicit by conditioning on the discrete 
variables, with a cost that grows exponentially in the number of components, 
whereas the sampling part of SAEM only grows linearly. 

B.4. BG-ICA and EBG-ICA models. These two models have the same 
parameters and maximize the same function. The likelihood is 

2 

-logC{a^,A,fi,a) 









with sufficient statistic S = {PP^, X/j"^, u) with /S^ = 1>'Y^ and u = b^ +■■■ +bP. 
The optimal parameters are 

A = [X/3^]([/3/3^])-i, 
a2 = [|X-A/3|2]/d, 
a=[u]/p. 

B.5. ET-ICA, TE-ICA and TEofT-ICA models. We turn to the ternary 
models which share the same parameters (up to /io for the offset model). 
The likelihood to maximize is 

rt 2 



log 



7 



1-7 



Ei^^i 



1 

2^ 



logC(c72,A,7) 



with sufficient statistic S = (/3/3^,X/3^,C), P^ = s^Y^, C = \Y^\ + ■■■ + \YP\. 
The optimal parameters are 

A = [X/3^]([/3/3^])-i, 

CT2 = [|X-A/3|2]/d, 

The maximum likelihood estimator for the single scale model is given by the 
same formulae, using /?■' = sY^ . 

SUPPLEMENTARY MATERIAL 

Supplement to "A stochastic algorithm for probabilistic independent com- 
ponent analysis" (DOI: 10.1214/11-AOAS499SUPP; .pdf). This file presents 
a larger version of some of the images contained in this paper. 
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